GitHub

The code for the herein described process can also be freely downloaded from https://github.com/fzenoni/london-accidents.

Baby, you can drive my car

Besides being able to “put dots on a map”, R can be used in very effective ways to do spatial analytics. Last month, Stefano already provided descriptive statistics on a Kaggle dataset that includes 16 years of UK accidents.

It is now time to move up a gear (no pun intended), and tackle one of the many ways of extracting spatial insights from the data. In this post, we are going to ignore the potential analysis brought by the time series. It will certainly be the topic of another post.

Instead we will try to propose an answer to the following questions: what if the government wanted to highlight the areas of a city showing some alarming characteristics, with a given statistical significance? What if we wanted to discover what are London’s most dangerous roads or intersections for car drivers?

In fact, London happens too large for our pruposes. I will only investigate a neighborhood, but the method shown in the following stays valid at any scale.

“80% of the data analyst job”

First things first, we load the data and clean it a bit. The fastest way to do it in-memory, while enjoying the functions devoted to tables, is still to use the data.table package.

# Load data
set <- list.files(path = '1-6m-accidents-traffic-flow-over-16-years',
                  pattern = 'accidents.*csv', full.names = TRUE)
data <- lapply(set, fread) %>% rbindlist
# Filter out empty locations
data <- na.omit(data, cols = c('Longitude', 'Latitude'))
# Remove duplicates
data <- data[!duplicated(data)]

data.table is nice and all, but since we work with spatial data we’re going to use the sf format, as we did in the past. As sf does not exactly extend data.table I’m going to cast the table to a data.frame first.

data <- data %>% as.data.frame %>% st_as_sf(coords = c('Longitude', 'Latitude'), crs = 4326)

This data include accidents over 16 years for the whole of UK. It represents a lot of records, and performance wise, I don’t necessarily have a strategy in place to analyze them all. As a first move, I’m going to select the events that fall inside the boroughs of London administrative boundaries. The Kaggle dataset luckily includes the geoJSON of UK’s districts. I’ve had mixed feelings in the past concerning this format, but the bad ones were wiped out by the recent discovery of two methods to open and import it as sf.

The first one is the classic sf::read_sf().

system.time(sf::read_sf('1-6m-accidents-traffic-flow-over-16-years/Local_Authority_Districts_Dec_2016.geojson'))
##    user  system elapsed 
##   7.069   0.592   8.074

But the freshest discovery is the geojsonsf::geojson_sf() function from SymbolixAU (check their blog post post here), that serves exactly our purpose, in an even faster way.

system.time(map <- geojsonsf::geojson_sf('1-6m-accidents-traffic-flow-over-16-years/Local_Authority_Districts_Dec_2016.geojson'))
##    user  system elapsed 
##   1.819   0.117   2.615
head(map)
## Simple feature collection with 6 features and 10 fields
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: -2.832457 ymin: 53.30503 xmax: -0.7884304 ymax: 54.72717
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##    bng_e  bng_n                       geometry   lad16cd
## 1 447157 531476 MULTIPOLYGON (((-1.268456 5... E06000001
## 2 451141 516887 MULTIPOLYGON (((-1.243904 5... E06000002
## 3 464359 519597 MULTIPOLYGON (((-1.137578 5... E06000003
## 4 444937 518183 MULTIPOLYGON (((-1.317286 5... E06000004
## 5 428029 515649 POLYGON ((-1.637678 54.6171... E06000005
## 6 354246 382146 MULTIPOLYGON (((-2.626835 5... E06000006
##                lad16nm lad16nmw      lat     long objectid st_areashape
## 1           Hartlepool          54.67616 -1.27023        1     93559511
## 2        Middlesbrough          54.54467 -1.21099        2     53888581
## 3 Redcar and Cleveland          54.56752 -1.00611        3    244820281
## 4     Stockton-on-Tees          54.55691 -1.30669        4    204962233
## 5           Darlington          54.53535 -1.56835        5    197475689
## 6               Halton          53.33424 -2.68853        6     79084033
##   st_lengthshape
## 1       71707.33
## 2       43840.85
## 3       97993.29
## 4      119581.51
## 5      107206.28
## 6       77770.94

Now I would like to extract London’s borough, but I am bored by having to inspect the map, and the need to learn new geography. Since I am the laziest member of the team, I decided to harvest this Wikipedia page with the rvest package, and use the list to filter the regions of interest.

# Harvest the list of London boroughs from Wikipedia
wiki_london <- xml2::read_html('https://en.wikipedia.org/wiki/London_boroughs')
boroughs1 <- wiki_london %>% rvest::html_nodes('ol') %>% .[[1]] %>% rvest::html_text()
boroughs2 <- wiki_london %>% rvest::html_nodes('ol') %>% .[[2]] %>% rvest::html_text() 

list1 <- as.list(strsplit(boroughs1, "\n")) %>% unlist
list2 <- as.list(strsplit(boroughs2, "\n")) %>% unlist
list <- c(list1, list2)

# Special cases to fix
list <- replace(list, list=='City of London (not a London borough)', 'City of London')
list <- replace(list, list=='City of Westminster', 'Westminster')

# Filter map
london <- map %>% dplyr::filter(lad16nm %in% list)
# Unite the boroughs
london_union <- london %>% st_union

The London map is ready to be displayed.

plot(london_union)

As mentioned before, originally I wanted to analyze all of London’s data, but it seems that to do within today I would need to parallelize part of the analysis code (I have some plans, so maybe it will be a task for a post addendum). Instead, I will analyze the data falling into a radius of 2000 m from London’s centroid. For sf aficionados, this is a trivial task.

# Project to British National Grid (http://epsg.io/27700)
data <- data %>% st_transform(crs = 27700)
london_union <- london_union %>% st_transform(crs = 27700)

# Build a circle
center <- st_centroid(london_union)
max_radius <- 2000
london_circle <- st_buffer(center, max_radius)

# Filter data thanks to map
london_data <- data[london_circle,]

This is what we got. I know, I know, it’s a small sample!

# Original amount of data
nrow(data)
## [1] 1469894
# Filtered data
nrow(london_data)
## [1] 9205
# Draw the area
plot(london_union)
sf.colors(alpha = 0.4)
##  [1] "#0000B366" "#0400FF66" "#4500FF66" "#8500FF66" "#C527D866"
##  [6] "#FF50AF66" "#FF7A8566" "#FFA35C66" "#FFCC3366" "#FFF50A66"
plot(london_circle, add = T, col = sf.colors(n = 1, alpha = 0.3))

# Display the accidents as points
mapview(london_data)

Spatial data

We’re now ready to inspect some spatial data. I will split the accidents in two categories. This distinction is highly arbitrary: in order to be able to use the final results (i.e. where shoud Sadiq Khan spend taxpayer’s money to increase security), more time should be spent in understanding the (meta-)data and in acquiring domain knowledge. But once again, even with a suboptimal decision at this stage, the following method stays valid.

I decided to split the data into Severe and Non-Severe accidents, the former involving more than one casualty, the latter strictly one. To visualize and work on spatial densities, a special kind of object, point pattern dataset, coming from the spatstat package, will be used. The package is not (yet) able to deal with sf, so we’re going to hold our noses and go back for a moment to sp.

# Create spatstat ppp object piece by piece
# Define window
london_owin <- as(london_circle, 'Spatial') %>% as.owin.SpatialPolygons()
# Extract accident coordinates
london_coords <- st_coordinates(london_data)
# Build "marks" or features
london_marks <- as.factor(ifelse(london_data$Number_of_Casualties == 1, 'Non-Severe', 'Severe'))
# Define ppp
london_ppp <- ppp(x = london_coords[, 1], y = london_coords[, 2],
                  window = london_owin,
                  marks = london_marks)
## Warning: data contain duplicated points

With such an object, we can use spatstat to quickly display useful information. We start by showing the fraction of Severe accidents over the total.

# Split the data according to the "marks"
london_splits <- split(london_ppp)
# Compute densities
accident_densities <- density(london_splits)
# Display fractional density
frac_severe_accidents <- accident_densities[[2]]/(accident_densities[[1]] + accident_densities[[2]])
plot(frac_severe_accidents)

This plot is cool, but it just tells us that the severe accidents are present in that area with a percentage that oscillates between approx. 8% and 11%. Are the highest concentrations meaningful? Are some areas actually more dangerous than others? Aren’t they simply the result of random fluctuations?

There is one way to find out. The technique I’m going to apply is called spatial segregation, in the context of point pattern analysis. I am going to display the density of accidents - split in two custom categories - thanks to kernel density. A smooth curve will be fitted on the data: highest values will correspond to the location of points, and these values will diminish as the distance from the point increases. However, a crucial parameter must first be determined: the bandwidth. Choose it too small, and the density will be at the maximum exactly where the points are, and zero elsewhere; choose it too large and the density will appear as a blob without any features.

spseg(opt = 1) from the spatialkernel package will assist us in providing the best bandwidth value given our densities.

bw_choice <- spseg(pts = london_ppp,
                   h = seq(300, 500, 20), opt = 1)

The h parameter indicates the values to be tested by the cross-validation algorithm. The best one is then provided as follows.

plotcv(bw_choice); abline(v = bw_choice$hcv, lty = 2, col = "red")